Linear Regression method is one of the most common research methods examining the linear relationship of the dependent variable \(Y\) and independent variable(s) \(X\).
If you want to determine the value of a continuous response variable \(Y\) given the values of \(p >= 1\) independent explanatory variables \(X_1, X_2, . . ., X_p\). The overarching model is defined as
\[Y = \beta_0+\beta_1X_1+\beta_2X_2+\cdots+\beta_pX_p+\epsilon,\] where \(\beta_0,\cdots,\beta_p\) are the regression coefficients and we assume indepent, normally distributed residuals \(\epsilon \sim N(0,\sigma)\) around the mean.
The model to be fitted is given in terms of the mean response, conditional upon a particular realization of the set of explanatory variables \[\hat{y} = E[Y|X_1=x_1,X_2=x_2,\cdots,X_p=x_p]=\hat{\beta}_0+\hat{\beta_1}x_1+\cdots+\hat{\beta}_px_p,\] where \(\hat{\beta}_j\) represent estimates of the regression coefficients.
For the \(n\) data records, the \(\hat{\beta}_j\) are found as the values that minimize the sum
\[\sum^n_{i=1}[y_i-(\hat{\beta}_0+\hat{\beta_1}x_{1,i}+\cdots+\hat{\beta}_px_{p,i})]^2,\] where \(x_{j,i}\) is the observed value of individual \(i\) for explanatory variable \(X_j\) and \(y_i\) is their response value.
The computations involved in minimizing this squared distance are made much easier by a matrix representation of the data. When dealing with \(n\) multivariate observations, you can write the linear model as \[\textbf{Y}=\textbf{X}\cdot\boldsymbol{\beta}+\boldsymbol{\epsilon},\]
where \(\boldsymbol{Y}\) and \(\boldsymbol{\epsilon}\) denote \(n\times 1\) column matrices such that
\[ \mathbf{Y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \quad \text{and} \quad \boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}. \] Here, \(y_i\) and \(\epsilon_i\) refer to the response observation and random error term for the \(i\)-th individual. The quantity \(\boldsymbol{\beta}\) is a \((p + 1) \times 1\) column matrix of the regression coefficients, and the observed predictor data for all individuals and explanatory variables are stored in an \(n \times (p + 1)\) matrix \(\mathbf{X}\), called the design matrix:
\[ \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} \quad \text{and} \quad \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & \cdots & x_{1,p} \\ 1 & x_{2,1} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & \cdots & x_{n,p} \end{bmatrix}. \] The minimization of least square providing the estimated regression coefficient values is then found with the following calculation: \[ \hat{\boldsymbol{\beta}} = \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \vdots \\ \hat{\beta}_p \end{bmatrix} = \left( \mathbf{X}^{\top} \mathbf{X} \right)^{-1} \mathbf{X}^{\top} \mathbf{Y} \]
You can manually estimate the \(\beta_j,(j=0,1,\cdots,p)\) in
R. As a quick demonstration, let’s say you have two
predictor variables: X1 as continuous and X2 as binary. Your target
regression equation is therefore \(\hat{y}=\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2\).
Suppose you collect the following data, where the response data, data for \(X_1\), and data for \(X_2\), for \(n = 8\) individuals, are given in the columns \(y\), \(x_1\), and \(x_2\), respectively.
demo.data <- data.frame(y=c(1.55,0.42,1.29,0.73,0.76,-1.09,1.41,-0.32),
x1=c(1.13,-0.73,0.12,0.52,-0.54,-1.15,0.20,-1.09),
x2=c(1,0,1,1,0,1,0,1))
demo.data
## y x1 x2
## 1 1.55 1.13 1
## 2 0.42 -0.73 0
## 3 1.29 0.12 1
## 4 0.73 0.52 1
## 5 0.76 -0.54 0
## 6 -1.09 -1.15 1
## 7 1.41 0.20 0
## 8 -0.32 -1.09 1
Question: Now, try to compute the point estimates in \(\boldsymbol{\beta}\) for the linear model using the equation we deducted earlier.
lm()Let’s first get sample data.
#install.packages(“car”) ##if this is a new package we need to install first.
library(car) #We have to library the package every time we use
We are using the dataset “Prestige” * in the car package for this guide. We can use the help command to access the codebook:
help("Prestige")
View(Prestige)
Question: Can you use function
lm() to describe the relationship between the dependent
variable(prestige) and one independent variable(education)? And how can
we interpret the model?
Hint: you can use summary to view the result.
Log transformation:
It’s a useful practice to transform the variable to it’s log form when doing the regression. You can either transform the variable beforehand or do so in the equation.
Note that in some cases the variable contains the value of 0 so we cannot directly transform them to the log-form. We can add the value of 1 to every observation and then do the transformation.
Question: Use the same data, now, use log transformaton on income, and do another linear model regarding the relationship between the response variable ‘prestige’ and the three independent variables ‘education’, ‘income’ and ‘women’. And how can we interpret the model now?
Predicted values and Residuals:
Question Using regression model 2, how can you predict the value of prestige and how do you obtain the residuals?
Categorical variables play a big part in the linear regression too.
Note that we only talk about where independent variables are categorical
variables in this guide. When the dependent variable is a categorical
variable, you may consider the alternatives of linear regression like
logit regression and multinomial regression. In this dataset,
type is a categorical variable that describes the type of
occupation.
Question Construct a linear model that describe the relationship between the response variable ‘prestige’ and three independent variables ‘education’, ‘log(income)’ and ‘type’.
Sometimes we are interested in the interaction terms of two variables. Here is an example of interaction term with categorical variables.
Question We would like to use the same independent variables as before, but now, we want to include the interation between ‘type’ and ‘education’, ‘type’ and ‘log(income)’. Can you consider how to construct this linear model?
It’s a good habit to check for the assumptions of linear regressions to see whether the model is good. We will use the raw variables (without transforming) to construct a model and then we can test for the assumptions. Note that these commands, if not specifically stated, are from the ‘car’ package.
It’s important to check whether the normality assumption holds for the model.
Now we consider the linear model
model1 <- lm(prestige ~ education + income + type, data = Prestige)
Question How can we test the normality
assumption? Hint: you can use the matrial from tutorial 3: ANOVA to
check the normality, both from qqPlot() and
`shapiro.test()
We can use the ncvTest() or leveneTest(),
etc. to test for equal variances. We can also consult with the residual
plots, which we will discuss later, whether the model meets the
homoscedasticity assumption.
Question Please use
ncvTest() on model1 and check the result.
We want to know whether we have too many variables that have high correlation with each other.
“When there are strong linear relationships among the predictors in a regression analysis, the precision of the estimated regression coefficients in linear models declines compared to what it would have been were the predictors uncorrelated with each other” (Fox:359)
Question Please use vif()
on model1 and check the result. A GVIF > 4 suggests collinearity.
We can also plot the residual plots on model1.
Question Please plot the residuals plots and comment.
Outlier Identification is also significant.
We can first look at the QQplot.
model1 <- lm(prestige ~ education + income + type, data = Prestige)
qqPlot(model1,id.n=2)
## medical.technicians electronic.workers
## 31 82
We can look at the Influential/ high leverage points.
#Influence Plots
influencePlot(model1)
## StudRes Hat CookD
## general.managers -1.3134574 0.33504477 0.172503975
## physicians -0.3953204 0.22420309 0.009115491
## medical.technicians 2.8210910 0.06858836 0.109052582
## electronic.workers 2.2251940 0.02701237 0.026372394
In the case of quantitative predictors, we’re more or less comfortable with the interpretation of the linear model coefficient as a “slope” or a “unit increase in outcome per unit increase in the covariate”. This isn’t the right interpretation for factor variables. In particular, the notion of a slope or unit change no longer makes sense when talking about a categorical variable. E.g., what does it even mean to say “unit increase in major” when studying the effect of college major on future earnings?
To understand what the coefficients really mean, let’s go to the
birthwt data and try regressing birthweight on mother’s
race and mother’s age.
library(tidyverse)
# Load data from MASS into a tibble
# Rename variables
# Recode categorical variables
birthwt <- as_tibble(MASS::birthwt) %>%
rename(birthwt.below.2500 = low,
mother.age = age,
mother.weight = lwt,
mother.smokes = smoke,
previous.prem.labor = ptl,
hypertension = ht,
uterine.irr = ui,
physician.visits = ftv,
birthwt.grams = bwt) %>%
mutate(race = recode_factor(race, `1` = "white", `2` = "black", `3` = "other")) %>%
mutate_at(c("mother.smokes", "hypertension", "uterine.irr", "birthwt.below.2500"),
~ recode_factor(.x, `0` = "no", `1` = "yes"))
# Fit regression model
birthwt.lm <- lm(birthwt.grams ~ race + mother.age, data = birthwt)
# Regression model summary
summary(birthwt.lm)
##
## Call:
## lm(formula = birthwt.grams ~ race + mother.age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2131.57 -488.02 -1.16 521.87 1757.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2949.979 255.352 11.553 <2e-16 ***
## raceblack -365.715 160.636 -2.277 0.0240 *
## raceother -285.466 115.531 -2.471 0.0144 *
## mother.age 6.288 10.073 0.624 0.5332
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 715.7 on 185 degrees of freedom
## Multiple R-squared: 0.05217, Adjusted R-squared: 0.0368
## F-statistic: 3.394 on 3 and 185 DF, p-value: 0.01909
Note that there are two coefficients estimated for the race variable
(raceother and racewhite). What’s happening
here?
When you put a factor variable into a regression, you’re allowing a
different intercept at every level of the factor. In
the present example, you’re saying that you want to model
birthwt.grams as
We can rewrite this more succinctly as: \[ y=\mathrm{Intercept}_{race}+\beta\times age \] Essentially you’re saying that your data is broken down into 3 racial groups, and you want to model your data as having the same slope governing how birthweight changes with mother’s age, but potentially different intercepts. Here’s a picture of what’s happening.
# Calculate race-specific intercepts
intercepts <- c(coef(birthwt.lm)["(Intercept)"],
coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceblack"],
coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceother"])
lines.df <- data.frame(intercepts = intercepts,
slopes = rep(coef(birthwt.lm)["mother.age"], 3),
race = levels(birthwt$race))
ggplot(mapping = aes(x = mother.age, y = birthwt.grams, color = race), data = birthwt) +
geom_point() +
geom_abline(aes(intercept = intercepts,
slope = slopes,
color = race), data = lines.df)
How do we interpret the 2 race coefficients? For categorical
variables, the interpretation is relative to the given baseline. The
baseline is just whatever level comes first (here, “white”). E.g., the
estimate of raceother means that the estimated intercept is
-285.47 higher among “other” race mothers compared to black mothers.
Similarly, the estimated intercept is -365.72 higher for black mothers
than white mothers.
Another way of putting it: Among mothers of the same age, babies of black mothers are born on average weighing -365.72g more than babies of white mothers.
As you’ve already noticed, there is no coefficient called “racewhite” in the estimated model. This is because this coefficient gets absorbed into the overall (Intercept) term.
Let’s peek under the hood. Using the model.matrix()
function on our linear model object, we can get the data matrix that
underlies our regression. Here are the first 20 rows.
head(model.matrix(birthwt.lm), 20)
## (Intercept) raceblack raceother mother.age
## 1 1 1 0 19
## 2 1 0 1 33
## 3 1 0 0 20
## 4 1 0 0 21
## 5 1 0 0 18
## 6 1 0 1 21
## 7 1 0 0 22
## 8 1 0 1 17
## 9 1 0 0 29
## 10 1 0 0 26
## 11 1 0 1 19
## 12 1 0 1 19
## 13 1 0 1 22
## 14 1 0 1 30
## 15 1 0 0 18
## 16 1 0 0 18
## 17 1 1 0 15
## 18 1 0 0 25
## 19 1 0 1 20
## 20 1 0 0 28
Even though we think of the regression
birthwt.grams ~ race + mother.age as being a regression on
two variables (and an intercept), it’s actually a regression on 3
variables (and an intercept). This is because the race
variable gets represented as two dummy variables: one for
race == other and the other for
race == black.
Why isn’t there a column for representing the indicator of
race == white? This gets back to our colinearity issue. By
definition, we have that
This is because for every observation, one and only one of the race
dummy variables will equal 1. Thus the group of 4 variables {racewhite,
raceother, raceblack, (Intercept)} is perfectly colinear, and we can’t
include all 4 of them in the model. The default behavior in
R is to remove the dummy corresponding to the first level
of the factor (here, racewhite), and to keep the rest.
Let’s go back to the regression line plot we generated above.
ggplot(mapping = aes(x = mother.age, y = birthwt.grams, color = race), data = birthwt) +
geom_point() +
geom_abline(aes(intercept = intercepts,
slope = slopes,
color = race), data = lines.df)
We can have similar plots by using the geom_smooth or
stat_smooth commands in ggplot. Compare the
plot above to the following.
ggplot(mapping = aes(x = mother.age, y = birthwt.grams, color = race), data = birthwt) +
geom_point() +
stat_smooth(method = "lm", se = FALSE, fullrange = TRUE)
In this case we have not only race-specific intercepts, but also race-specific slopes. The plot above corresponds to the model:
We can rewrite this more succinctly as: \[
y=\mathrm{Intercept}_{race}+\beta_{race}\times age
\] To specify this interaction model in R, we use
the following syntax
birthwt.lm.interact <- lm(birthwt.grams ~ race * mother.age, data = birthwt)
summary(birthwt.lm.interact)
##
## Call:
## lm(formula = birthwt.grams ~ race * mother.age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2182.35 -474.23 13.48 523.86 1496.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2583.54 321.52 8.035 1.11e-13 ***
## raceblack 1022.79 694.21 1.473 0.1424
## raceother 326.05 545.30 0.598 0.5506
## mother.age 21.37 12.89 1.658 0.0991 .
## raceblack:mother.age -62.54 30.67 -2.039 0.0429 *
## raceother:mother.age -26.03 23.20 -1.122 0.2633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 710.7 on 183 degrees of freedom
## Multiple R-squared: 0.07541, Adjusted R-squared: 0.05015
## F-statistic: 2.985 on 5 and 183 DF, p-value: 0.01291
We now have new terms appearing. Terms like
raceblack:mother.age are deviations from the baseline slope
(the coefficient of mother.age in the model) in the same
way that terms like raceblack are deviations from the
baseline intercept. This models says that:
On average among white mothers, every additional year of age is associated with a 21.37g increase in the birthweight of the baby.
To get the slope for black mothers, we need to add the interaction term to the baseline. \[ \beta_{\text{raceblack}}=\beta_{\text{racewhite}}+\beta_{\text{raceblack:mother.age}}=\text{mother.age}+\text{raceblack:mother.age}=21.3727574+(-62.5379308)= -41.1651733 \] This slope estimate is negative, which agrees with the regression plot above.
Last class we considered modelling birthweight as a linear function of mother’s age, allowing for a race-specific intercept for each of the three race categories. This model is fit again below.
birthwt.lm <- lm(birthwt.grams ~ race + mother.age, data = birthwt)
Here’s a visualization of the model fit that we wound up with. Note
that while there are 3 lines shown, this is a visualization of just one
model: birthwt.grams ~ race + mother.age. This model
produces 3 lines because the coefficients of the race
variable result in different intercepts.
# Calculate race-specific intercepts
intercepts <- c(coef(birthwt.lm)["(Intercept)"],
coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceblack"],
coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceother"])
lines.df <- data.frame(intercepts = intercepts,
slopes = rep(coef(birthwt.lm)["mother.age"], 3),
race = levels(birthwt$race))
ggplot(mapping = aes(x = mother.age, y = birthwt.grams, color = race), data = birthwt) +
geom_point() +
geom_abline(aes(intercept = intercepts,
slope = slopes,
color = race), data = lines.df)
At this stage we may be interested in assessing whether the race
variable is statistically significant. i.e., Does including the
race variable significantly improve the fit of our model,
or is the simpler model birthwt.grams ~ mother.age just as
good?
Essentially, we want to know if the race-specific intercepts capture significantly more variation in the outcome (birthweight) than the single intercept model, or if allowing for different intercepts isn’t doing much more than capturing random fluctuations in the data.
Here’s a picture of the two models we’re comparing:
library(gridExtra)
plot.complex <- ggplot(mapping = aes(x = mother.age, y = birthwt.grams, color = race), data = birthwt) +
geom_point() +
geom_abline(aes(intercept = intercepts,
slope = slopes,
color = race), data = lines.df)
# Single intercept model (birthwt.grams ~ mother.age)
p <- ggplot(birthwt, aes(x = mother.age, y = birthwt.grams))
plot.simple <- p + geom_point(aes(colour = race)) + stat_smooth(method = "lm")
grid.arrange(plot.complex, plot.simple, ncol = 2)
To test this hypothesis, we use the anova function (not to be confused with the aov function). This function compares two nested models, accounting for their residual sums of squares (how well they fit the data) and their complexity (how many more variables are in the larger model) to assess statistical significance.
# Fit the simpler model with mother.age as the only predictor
birthwt.lm.simple <- lm(birthwt.grams ~ mother.age, data = birthwt)
# Compare to more complex model
anova(birthwt.lm.simple, birthwt.lm)
## Analysis of Variance Table
##
## Model 1: birthwt.grams ~ mother.age
## Model 2: birthwt.grams ~ race + mother.age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 187 99154173
## 2 185 94754346 2 4399826 4.2951 0.01502 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This output tells us that the race variable is
statistically significant: It is unlikely that the improvement in fit
when the add the race variable is simply due to random fluctuations in
the data. Thus it is important to consider race when
modeling how birthweight depends on the mother’s age.
Assessing significance of interaction terms operates on the same principle. We once again ask whether the improvement in model fit is worth the increased complexity of our model. For instance, consider the example we saw last class, where we allowed for a race-specific slope in addition to the race-specific intercept from before.
birthwt.lm.interact <- lm(birthwt.grams ~ race * mother.age, data = birthwt)
summary(birthwt.lm.interact)
##
## Call:
## lm(formula = birthwt.grams ~ race * mother.age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2182.35 -474.23 13.48 523.86 1496.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2583.54 321.52 8.035 1.11e-13 ***
## raceblack 1022.79 694.21 1.473 0.1424
## raceother 326.05 545.30 0.598 0.5506
## mother.age 21.37 12.89 1.658 0.0991 .
## raceblack:mother.age -62.54 30.67 -2.039 0.0429 *
## raceother:mother.age -26.03 23.20 -1.122 0.2633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 710.7 on 183 degrees of freedom
## Multiple R-squared: 0.07541, Adjusted R-squared: 0.05015
## F-statistic: 2.985 on 5 and 183 DF, p-value: 0.01291
Here’s a side-by-side visual comparison of
therace + mother.age model and the
race + mother.age + race*mother.age interaction model.
plot.interact <- ggplot(mapping = aes(x = mother.age, y = birthwt.grams, color = race), data = birthwt) +
geom_point() +
stat_smooth(method = "lm", se = FALSE, fullrange = TRUE)
grid.arrange(plot.interact, plot.complex, ncol = 2)
So, do the lines with different slopes fit the data significantly
better than the common slope model? Let’s compare the two with the
anova() function.
anova(birthwt.lm, birthwt.lm.interact)
## Analysis of Variance Table
##
## Model 1: birthwt.grams ~ race + mother.age
## Model 2: birthwt.grams ~ race * mother.age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 185 94754346
## 2 183 92431148 2 2323199 2.2998 0.1032
This \(p\)-value turns out to not be
statistically significant. So even though the estimated slopes in the
interaction model look very different, our estimates are quite variable,
so we don’t have enough evidence to conclude that the interaction term
(different slopes) is providing significant additional explanatory power
over the simpler race + mother.age model.
The testing strategy above applies to any two nested models. Here’s
an example where we add in a few more variables and see how it compares
to the race + mother.age model from earlier.
birthwt.lm.complex <- lm(birthwt.grams ~ mother.smokes + physician.visits + race + mother.age, data = birthwt)
summary(birthwt.lm.complex)
##
## Call:
## lm(formula = birthwt.grams ~ mother.smokes + physician.visits +
## race + mother.age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2335.06 -455.16 31.74 499.29 1623.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3282.407 261.326 12.561 < 2e-16 ***
## mother.smokesyes -424.651 110.371 -3.847 0.000165 ***
## physician.visits 14.391 48.953 0.294 0.769102
## raceblack -444.340 156.586 -2.838 0.005057 **
## raceother -445.161 119.666 -3.720 0.000265 ***
## mother.age 1.547 9.996 0.155 0.877155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 691.7 on 183 degrees of freedom
## Multiple R-squared: 0.1241, Adjusted R-squared: 0.1001
## F-statistic: 5.184 on 5 and 183 DF, p-value: 0.000179
Let’s compare to our earlier model:
anova(birthwt.lm, birthwt.lm.complex)
## Analysis of Variance Table
##
## Model 1: birthwt.grams ~ race + mother.age
## Model 2: birthwt.grams ~ mother.smokes + physician.visits + race + mother.age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 185 94754346
## 2 183 87567280 2 7187067 7.5098 0.0007336 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Highly significant! This is probably due to the fact that mother’s smoking status has a tremendously high association with birthweight.
(gapminder <- read_delim("http://www.andrew.cmu.edu/user/achoulde/94842/data/gapminder_five_year.txt",
delim = "\t")) # Load data
## # A tibble: 1,704 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.8 779.
## 2 Afghanistan 1957 9240934 Asia 30.3 821.
## 3 Afghanistan 1962 10267083 Asia 32.0 853.
## 4 Afghanistan 1967 11537966 Asia 34.0 836.
## 5 Afghanistan 1972 13079460 Asia 36.1 740.
## 6 Afghanistan 1977 14880372 Asia 38.4 786.
## 7 Afghanistan 1982 12881816 Asia 39.9 978.
## 8 Afghanistan 1987 13867957 Asia 40.8 852.
## 9 Afghanistan 1992 16317921 Asia 41.7 649.
## 10 Afghanistan 1997 22227415 Asia 41.8 635.
## # ℹ 1,694 more rows
Before diving into a regression, let’s look at some data summaries that use some familiar functions. Here’s how we can form a table showing the maximum life expectancy on each continent each year, and the country that attained that maximum.
gapminder %>%
group_by(continent, year) %>%
summarize(max.life.exp = max(lifeExp),
country = country[which.max(lifeExp)])
## # A tibble: 60 × 4
## # Groups: continent [5]
## continent year max.life.exp country
## <chr> <dbl> <dbl> <chr>
## 1 Africa 1952 52.7 Reunion
## 2 Africa 1957 58.1 Mauritius
## 3 Africa 1962 60.2 Mauritius
## 4 Africa 1967 61.6 Mauritius
## 5 Africa 1972 64.3 Reunion
## 6 Africa 1977 67.1 Reunion
## 7 Africa 1982 69.9 Reunion
## 8 Africa 1987 71.9 Reunion
## 9 Africa 1992 73.6 Reunion
## 10 Africa 1997 74.8 Reunion
## # ℹ 50 more rows
We’re now going to go through an example where we get the life
expectancy in 1952 and the rate of change in life expectancy over time
for each country. The rate of change will be obtained by regressing
lifeExp on year.
Let’s start with the data for a single country.
country.name <- "Ireland" # Pick a country
gapminder.sub <- filter(gapminder, country == country.name) # Pull data for this country
gapminder.sub
## # A tibble: 12 × 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Ireland 1952 2952156 Europe 66.9 5210.
## 2 Ireland 1957 2878220 Europe 68.9 5599.
## 3 Ireland 1962 2830000 Europe 70.3 6632.
## 4 Ireland 1967 2900100 Europe 71.1 7656.
## 5 Ireland 1972 3024400 Europe 71.3 9531.
## 6 Ireland 1977 3271900 Europe 72.0 11151.
## 7 Ireland 1982 3480000 Europe 73.1 12618.
## 8 Ireland 1987 3539900 Europe 74.4 13873.
## 9 Ireland 1992 3557761 Europe 75.5 17559.
## 10 Ireland 1997 3667233 Europe 76.1 24522.
## 11 Ireland 2002 3879155 Europe 77.8 34077.
## 12 Ireland 2007 4109086 Europe 78.9 40676.
# Scatterplot of life exp vs year
# with a regression line overlaid
ggplot(data = gapminder.sub, mapping = aes(x = year, y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm") +
ggtitle(paste("Life expectancy in", country.name))
We can confirm that it’s a pretty good model for other countries as well, though not for all of them
ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
facet_wrap( ~ country) +
geom_point() +
geom_smooth(method = "lm")
Now let’s fit a regression and extract the slope.
life.exp.lm <- lm(lifeExp ~ year, data = gapminder.sub) # Fit model
coef(life.exp.lm) # Get coefficients
## (Intercept) year
## -321.1399594 0.1991196
coef(life.exp.lm)["year"] # The slope that we wanted
## year
## 0.1991196
coef(lm(lifeExp ~ year, data = gapminder.sub))["year"]
## year
## 0.1991196
Here we’ll extract the slope in the way we practiced above, and we’ll also extract the “origin” life expectancy: the given country’s life expectancy in 1952, the first year of our data. For the purpose of plotting we’ll also want the continent information, so we’ll capture that in this call too.
progress.df <- gapminder %>%
group_by(country) %>%
summarize(continent = continent[1],
origin = lifeExp[year == 1952],
slope = lm(lifeExp ~ year)$coef["year"])
progress.df
## # A tibble: 142 × 4
## country continent origin slope
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan Asia 28.8 0.275
## 2 Albania Europe 55.2 0.335
## 3 Algeria Africa 43.1 0.569
## 4 Angola Africa 30.0 0.209
## 5 Argentina Americas 62.5 0.232
## 6 Australia Oceania 69.1 0.228
## 7 Austria Europe 66.8 0.242
## 8 Bahrain Asia 50.9 0.468
## 9 Bangladesh Asia 37.5 0.498
## 10 Belgium Europe 68 0.209
## # ℹ 132 more rows
Let’s summarize our findings by creating a bar chart for the origin and slope, with the bars colored by continent. We’ll do this in ggplot.
# Reorder country factor by origin
# Construct bar chart
progress.df %>%
mutate(country = reorder(country, origin)) %>%
ggplot(aes(x = country, y = origin, fill = continent)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
# Reorder country factor by slope
# Construct bar chart
progress.df %>%
mutate(country = reorder(country, slope)) %>%
ggplot(aes(x = country, y = slope, fill = continent)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
These are very interesting plots. What can you tell from looking at them?
Let’s start by looking at some plots of how GDP per capita varied by year
# qplot(year, gdpPercap, facets = ~ country, data = gapminder, colour = continent) +
ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap, colour = continent)) +
geom_point() +
facet_wrap(~ country, ncol=12) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
What if we want to rearrange the plots by continent? This can be done
by changing the order of the country level.
# First step: reorder the countries by continent
# Produce a data frame that has the countries ordered alphabetically within continent
# Arrange sorts the data according to the variable(s) provided
country.df <- gapminder %>%
group_by(country) %>%
summarize(continent = continent[1]) %>%
arrange(continent)
gapminder.ordered <- gapminder %>%
mutate(country = factor(country, levels = country.df$country))
# Let's make sure that things are now ordered correctly...
levels(gapminder.ordered$country)
## [1] "Algeria" "Angola"
## [3] "Benin" "Botswana"
## [5] "Burkina Faso" "Burundi"
## [7] "Cameroon" "Central African Republic"
## [9] "Chad" "Comoros"
## [11] "Congo, Dem. Rep." "Congo, Rep."
## [13] "Cote d'Ivoire" "Djibouti"
## [15] "Egypt" "Equatorial Guinea"
## [17] "Eritrea" "Ethiopia"
## [19] "Gabon" "Gambia"
## [21] "Ghana" "Guinea"
## [23] "Guinea-Bissau" "Kenya"
## [25] "Lesotho" "Liberia"
## [27] "Libya" "Madagascar"
## [29] "Malawi" "Mali"
## [31] "Mauritania" "Mauritius"
## [33] "Morocco" "Mozambique"
## [35] "Namibia" "Niger"
## [37] "Nigeria" "Reunion"
## [39] "Rwanda" "Sao Tome and Principe"
## [41] "Senegal" "Sierra Leone"
## [43] "Somalia" "South Africa"
## [45] "Sudan" "Swaziland"
## [47] "Tanzania" "Togo"
## [49] "Tunisia" "Uganda"
## [51] "Zambia" "Zimbabwe"
## [53] "Argentina" "Bolivia"
## [55] "Brazil" "Canada"
## [57] "Chile" "Colombia"
## [59] "Costa Rica" "Cuba"
## [61] "Dominican Republic" "Ecuador"
## [63] "El Salvador" "Guatemala"
## [65] "Haiti" "Honduras"
## [67] "Jamaica" "Mexico"
## [69] "Nicaragua" "Panama"
## [71] "Paraguay" "Peru"
## [73] "Puerto Rico" "Trinidad and Tobago"
## [75] "United States" "Uruguay"
## [77] "Venezuela" "Afghanistan"
## [79] "Bahrain" "Bangladesh"
## [81] "Cambodia" "China"
## [83] "Hong Kong, China" "India"
## [85] "Indonesia" "Iran"
## [87] "Iraq" "Israel"
## [89] "Japan" "Jordan"
## [91] "Korea, Dem. Rep." "Korea, Rep."
## [93] "Kuwait" "Lebanon"
## [95] "Malaysia" "Mongolia"
## [97] "Myanmar" "Nepal"
## [99] "Oman" "Pakistan"
## [101] "Philippines" "Saudi Arabia"
## [103] "Singapore" "Sri Lanka"
## [105] "Syria" "Taiwan"
## [107] "Thailand" "Vietnam"
## [109] "West Bank and Gaza" "Yemen, Rep."
## [111] "Albania" "Austria"
## [113] "Belgium" "Bosnia and Herzegovina"
## [115] "Bulgaria" "Croatia"
## [117] "Czech Republic" "Denmark"
## [119] "Finland" "France"
## [121] "Germany" "Greece"
## [123] "Hungary" "Iceland"
## [125] "Ireland" "Italy"
## [127] "Montenegro" "Netherlands"
## [129] "Norway" "Poland"
## [131] "Portugal" "Romania"
## [133] "Serbia" "Slovak Republic"
## [135] "Slovenia" "Spain"
## [137] "Sweden" "Switzerland"
## [139] "Turkey" "United Kingdom"
## [141] "Australia" "New Zealand"
ggplot(data = gapminder.ordered, mapping = aes(x = year, y = gdpPercap, colour = continent)) +
geom_point() +
facet_wrap(~ country, ncol=12) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
stat_smooth(method = "lm")
In the case of quantitative predictors, we’re more or less comfortable with the interpretation of the linear model coefficient as a “slope” or a “unit increase in outcome per unit increase in the covariate”. This isn’t the right interpretation for factor variables. In particular, the notion of a slope or unit change no longer makes sense when talking about a categorical variable. E.g., what does it even mean to say “unit increase in major” when studying the effect of college major on future earnings?
To understand what the coefficients really mean, let’s go to the
birthwt data and try regressing birthweight on mother’s
race and mother’s age.
library(tidyverse)
# Load data from MASS into a tibble
# Rename variables
# Recode categorical variables
birthwt <- as_tibble(MASS::birthwt) %>%
rename(birthwt.below.2500 = low,
mother.age = age,
mother.weight = lwt,
mother.smokes = smoke,
previous.prem.labor = ptl,
hypertension = ht,
uterine.irr = ui,
physician.visits = ftv,
birthwt.grams = bwt) %>%
mutate(race = recode_factor(race, `1` = "white", `2` = "black", `3` = "other")) %>%
mutate_at(c("mother.smokes", "hypertension", "uterine.irr", "birthwt.below.2500"),
~ recode_factor(.x, `0` = "no", `1` = "yes"))
# Fit regression model
birthwt.lm <- lm(birthwt.grams ~ race + mother.age, data = birthwt)
# Regression model summary
summary(birthwt.lm)
##
## Call:
## lm(formula = birthwt.grams ~ race + mother.age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2131.57 -488.02 -1.16 521.87 1757.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2949.979 255.352 11.553 <2e-16 ***
## raceblack -365.715 160.636 -2.277 0.0240 *
## raceother -285.466 115.531 -2.471 0.0144 *
## mother.age 6.288 10.073 0.624 0.5332
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 715.7 on 185 degrees of freedom
## Multiple R-squared: 0.05217, Adjusted R-squared: 0.0368
## F-statistic: 3.394 on 3 and 185 DF, p-value: 0.01909
Note that there are two coefficients estimated for the race variable
(raceother and racewhite). What’s happening
here?
When you put a factor variable into a regression, you’re allowing a
different intercept at every level of the factor. In
the present example, you’re saying that you want to model
birthwt.grams as
We can rewrite this more succinctly as: \[ y=\mathrm{Intercept}_{race}+\beta\times age \] Essentially you’re saying that your data is broken down into 3 racial groups, and you want to model your data as having the same slope governing how birthweight changes with mother’s age, but potentially different intercepts. Here’s a picture of what’s happening.
# Calculate race-specific intercepts
intercepts <- c(coef(birthwt.lm)["(Intercept)"],
coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceblack"],
coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceother"])
lines.df <- data.frame(intercepts = intercepts,
slopes = rep(coef(birthwt.lm)["mother.age"], 3),
race = levels(birthwt$race))
ggplot(mapping = aes(x = mother.age, y = birthwt.grams, color = race), data = birthwt) +
geom_point() +
geom_abline(aes(intercept = intercepts,
slope = slopes,
color = race), data = lines.df)
How do we interpret the 2 race coefficients? For categorical
variables, the interpretation is relative to the given baseline. The
baseline is just whatever level comes first (here, “white”). E.g., the
estimate of raceother means that the estimated intercept is
-285.47 higher among “other” race mothers compared to black mothers.
Similarly, the estimated intercept is -365.72 higher for black mothers
than white mothers.
Another way of putting it: Among mothers of the same age, babies of black mothers are born on average weighing -365.72g more than babies of white mothers.
As you’ve already noticed, there is no coefficient called “racewhite” in the estimated model. This is because this coefficient gets absorbed into the overall (Intercept) term.
Let’s peek under the hood. Using the model.matrix()
function on our linear model object, we can get the data matrix that
underlies our regression. Here are the first 20 rows.
head(model.matrix(birthwt.lm), 20)
## (Intercept) raceblack raceother mother.age
## 1 1 1 0 19
## 2 1 0 1 33
## 3 1 0 0 20
## 4 1 0 0 21
## 5 1 0 0 18
## 6 1 0 1 21
## 7 1 0 0 22
## 8 1 0 1 17
## 9 1 0 0 29
## 10 1 0 0 26
## 11 1 0 1 19
## 12 1 0 1 19
## 13 1 0 1 22
## 14 1 0 1 30
## 15 1 0 0 18
## 16 1 0 0 18
## 17 1 1 0 15
## 18 1 0 0 25
## 19 1 0 1 20
## 20 1 0 0 28
Even though we think of the regression
birthwt.grams ~ race + mother.age as being a regression on
two variables (and an intercept), it’s actually a regression on 3
variables (and an intercept). This is because the race
variable gets represented as two dummy variables: one for
race == other and the other for
race == black.
Why isn’t there a column for representing the indicator of
race == white? This gets back to our colinearity issue. By
definition, we have that
This is because for every observation, one and only one of the race
dummy variables will equal 1. Thus the group of 4 variables {racewhite,
raceother, raceblack, (Intercept)} is perfectly colinear, and we can’t
include all 4 of them in the model. The default behavior in
R is to remove the dummy corresponding to the first level
of the factor (here, racewhite), and to keep the rest.
Let’s go back to the regression line plot we generated above.
ggplot(mapping = aes(x = mother.age, y = birthwt.grams, color = race), data = birthwt) +
geom_point() +
geom_abline(aes(intercept = intercepts,
slope = slopes,
color = race), data = lines.df)
We can have similar plots by using the geom_smooth or
stat_smooth commands in ggplot. Compare the
plot above to the following.
ggplot(mapping = aes(x = mother.age, y = birthwt.grams, color = race), data = birthwt) +
geom_point() +
stat_smooth(method = "lm", se = FALSE, fullrange = TRUE)
In this case we have not only race-specific intercepts, but also race-specific slopes. The plot above corresponds to the model:
We can rewrite this more succinctly as: \[
y=\mathrm{Intercept}_{race}+\beta_{race}\times age
\] To specify this interaction model in R, we use
the following syntax
birthwt.lm.interact <- lm(birthwt.grams ~ race * mother.age, data = birthwt)
summary(birthwt.lm.interact)
##
## Call:
## lm(formula = birthwt.grams ~ race * mother.age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2182.35 -474.23 13.48 523.86 1496.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2583.54 321.52 8.035 1.11e-13 ***
## raceblack 1022.79 694.21 1.473 0.1424
## raceother 326.05 545.30 0.598 0.5506
## mother.age 21.37 12.89 1.658 0.0991 .
## raceblack:mother.age -62.54 30.67 -2.039 0.0429 *
## raceother:mother.age -26.03 23.20 -1.122 0.2633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 710.7 on 183 degrees of freedom
## Multiple R-squared: 0.07541, Adjusted R-squared: 0.05015
## F-statistic: 2.985 on 5 and 183 DF, p-value: 0.01291
We now have new terms appearing. Terms like
raceblack:mother.age are deviations from the baseline slope
(the coefficient of mother.age in the model) in the same
way that terms like raceblack are deviations from the
baseline intercept. This models says that:
On average among white mothers, every additional year of age is associated with a 21.37g increase in the birthweight of the baby.
To get the slope for black mothers, we need to add the interaction term to the baseline. \[ \beta_{\text{raceblack}}=\beta_{\text{racewhite}}+\beta_{\text{raceblack:mother.age}}=\text{mother.age}+\text{raceblack:mother.age}=21.3727574+(-62.5379308)= -41.1651733 \] This slope estimate is negative, which agrees with the regression plot above.
Last class we considered modelling birthweight as a linear function of mother’s age, allowing for a race-specific intercept for each of the three race categories. This model is fit again below.
birthwt.lm <- lm(birthwt.grams ~ race + mother.age, data = birthwt)
Here’s a visualization of the model fit that we wound up with. Note
that while there are 3 lines shown, this is a visualization of just one
model: birthwt.grams ~ race + mother.age. This model
produces 3 lines because the coefficients of the race
variable result in different intercepts.
# Calculate race-specific intercepts
intercepts <- c(coef(birthwt.lm)["(Intercept)"],
coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceblack"],
coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceother"])
lines.df <- data.frame(intercepts = intercepts,
slopes = rep(coef(birthwt.lm)["mother.age"], 3),
race = levels(birthwt$race))
ggplot(mapping = aes(x = mother.age, y = birthwt.grams, color = race), data = birthwt) +
geom_point() +
geom_abline(aes(intercept = intercepts,
slope = slopes,
color = race), data = lines.df)
At this stage we may be interested in assessing whether the race
variable is statistically significant. i.e., Does including the
race variable significantly improve the fit of our model,
or is the simpler model birthwt.grams ~ mother.age just as
good?
Essentially, we want to know if the race-specific intercepts capture significantly more variation in the outcome (birthweight) than the single intercept model, or if allowing for different intercepts isn’t doing much more than capturing random fluctuations in the data.
Here’s a picture of the two models we’re comparing:
library(gridExtra)
plot.complex <- ggplot(mapping = aes(x = mother.age, y = birthwt.grams, color = race), data = birthwt) +
geom_point() +
geom_abline(aes(intercept = intercepts,
slope = slopes,
color = race), data = lines.df)
# Single intercept model (birthwt.grams ~ mother.age)
p <- ggplot(birthwt, aes(x = mother.age, y = birthwt.grams))
plot.simple <- p + geom_point(aes(colour = race)) + stat_smooth(method = "lm")
grid.arrange(plot.complex, plot.simple, ncol = 2)
To test this hypothesis, we use the anova function (not to be confused with the aov function). This function compares two nested models, accounting for their residual sums of squares (how well they fit the data) and their complexity (how many more variables are in the larger model) to assess statistical significance.
# Fit the simpler model with mother.age as the only predictor
birthwt.lm.simple <- lm(birthwt.grams ~ mother.age, data = birthwt)
# Compare to more complex model
anova(birthwt.lm.simple, birthwt.lm)
## Analysis of Variance Table
##
## Model 1: birthwt.grams ~ mother.age
## Model 2: birthwt.grams ~ race + mother.age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 187 99154173
## 2 185 94754346 2 4399826 4.2951 0.01502 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This output tells us that the race variable is
statistically significant: It is unlikely that the improvement in fit
when the add the race variable is simply due to random fluctuations in
the data. Thus it is important to consider race when
modeling how birthweight depends on the mother’s age.
Assessing significance of interaction terms operates on the same principle. We once again ask whether the improvement in model fit is worth the increased complexity of our model. For instance, consider the example we saw last class, where we allowed for a race-specific slope in addition to the race-specific intercept from before.
birthwt.lm.interact <- lm(birthwt.grams ~ race * mother.age, data = birthwt)
summary(birthwt.lm.interact)
##
## Call:
## lm(formula = birthwt.grams ~ race * mother.age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2182.35 -474.23 13.48 523.86 1496.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2583.54 321.52 8.035 1.11e-13 ***
## raceblack 1022.79 694.21 1.473 0.1424
## raceother 326.05 545.30 0.598 0.5506
## mother.age 21.37 12.89 1.658 0.0991 .
## raceblack:mother.age -62.54 30.67 -2.039 0.0429 *
## raceother:mother.age -26.03 23.20 -1.122 0.2633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 710.7 on 183 degrees of freedom
## Multiple R-squared: 0.07541, Adjusted R-squared: 0.05015
## F-statistic: 2.985 on 5 and 183 DF, p-value: 0.01291
Here’s a side-by-side visual comparison of
therace + mother.age model and the
race + mother.age + race*mother.age interaction model.
plot.interact <- ggplot(mapping = aes(x = mother.age, y = birthwt.grams, color = race), data = birthwt) +
geom_point() +
stat_smooth(method = "lm", se = FALSE, fullrange = TRUE)
grid.arrange(plot.interact, plot.complex, ncol = 2)
So, do the lines with different slopes fit the data significantly
better than the common slope model? Let’s compare the two with the
anova() function.
anova(birthwt.lm, birthwt.lm.interact)
## Analysis of Variance Table
##
## Model 1: birthwt.grams ~ race + mother.age
## Model 2: birthwt.grams ~ race * mother.age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 185 94754346
## 2 183 92431148 2 2323199 2.2998 0.1032
This \(p\)-value turns out to not be
statistically significant. So even though the estimated slopes in the
interaction model look very different, our estimates are quite variable,
so we don’t have enough evidence to conclude that the interaction term
(different slopes) is providing significant additional explanatory power
over the simpler race + mother.age model.
The testing strategy above applies to any two nested models. Here’s
an example where we add in a few more variables and see how it compares
to the race + mother.age model from earlier.
birthwt.lm.complex <- lm(birthwt.grams ~ mother.smokes + physician.visits + race + mother.age, data = birthwt)
summary(birthwt.lm.complex)
##
## Call:
## lm(formula = birthwt.grams ~ mother.smokes + physician.visits +
## race + mother.age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2335.06 -455.16 31.74 499.29 1623.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3282.407 261.326 12.561 < 2e-16 ***
## mother.smokesyes -424.651 110.371 -3.847 0.000165 ***
## physician.visits 14.391 48.953 0.294 0.769102
## raceblack -444.340 156.586 -2.838 0.005057 **
## raceother -445.161 119.666 -3.720 0.000265 ***
## mother.age 1.547 9.996 0.155 0.877155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 691.7 on 183 degrees of freedom
## Multiple R-squared: 0.1241, Adjusted R-squared: 0.1001
## F-statistic: 5.184 on 5 and 183 DF, p-value: 0.000179
Let’s compare to our earlier model:
anova(birthwt.lm, birthwt.lm.complex)
## Analysis of Variance Table
##
## Model 1: birthwt.grams ~ race + mother.age
## Model 2: birthwt.grams ~ mother.smokes + physician.visits + race + mother.age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 185 94754346
## 2 183 87567280 2 7187067 7.5098 0.0007336 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Highly significant! This is probably due to the fact that mother’s smoking status has a tremendously high association with birthweight.
ANCOVA in R, or Analysis of Covariance, is a powerful statistical method that combines aspects of analysis of variance (ANOVA) and regression analysis. It is employed to assess group mean differences while taking into account the influence of continuous covariates. ANCOVA is particularly valuable when researchers need to control for variables that might confound the interpretation of their results. This technique allows us to determine whether there are significant differences in the means of dependent variables among various groups, even when considering the potential impact of covariates.
The main difference between ANOVA (Analysis of Variance) and ANCOVA (Analysis of Covariance) lies in how they handle and incorporate variables, as well as their primary objectives:
1. Purpose
ANOVA: Compares the means of two or more groups to determine if there are statistically significant differences between them.
ANCOVA: Compares the means of two or more groups while controlling for the effects of one or more continuous covariates (i.e., control variables).
2. Variables Involved
ANOVA:
Independent variables (factors): Categorical variables that define the groups.
Dependent variable: Continuous outcome variable being compared across groups.
ANCOVA:
Independent variables (factors): Categorical variables that define the groups.
Dependent variable: Continuous outcome variable being compared across groups.
Covariates: Continuous variables that are included to control for their effects, improving the accuracy of group comparisons.
3. Example Scenarios
ANOVA: You want to test if different teaching methods (Group A, B, C) lead to different average test scores.
ANCOVA: You want to test if different teaching methods (Group A, B, C) lead to different average test scores while controlling for students’ prior knowledge (e.g., pre-test scores).
The ANOVA model can be expressed as:
\[ Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \]
Where: - \(Y_{ij}\): Dependent variable (response)
\(\mu\): Overall mean
\(\alpha_i\): Effect of group \(i\)
\(\epsilon_{ij}\): Random error
The ANCOVA model adds a covariate to the equation:
\[ Y_{ij} = \mu + \alpha_i + \beta X_{ij} + \epsilon_{ij} \]
Where: - \(X_{ij}\): Covariate
\(\beta\): Regression coefficient for the covariate
Other terms (\(Y_{ij}\), \(\mu\), \(\alpha_i\), and \(\epsilon_{ij}\)) are as defined in the ANOVA model.
ANCOVA makes several assumptions about the data, such as:
Linearity between the covariate and the outcome variable at each level of the grouping variable.
This can be checked by creating a grouped scatter plot of the covariate and the outcome variable.
Homogeneity of regression slopes.
The slopes of the regression lines, formed by the covariate and the outcome variable, should be the same for each group. This assumption evaluates that there is no interaction between the outcome and the covariate. The plotted regression lines by groups should be parallel.
The outcome variable should be approximately
normally distributed.
This can be checked using the Shapiro-Wilk test of normality on the
model residuals.
Homoscedasticity or homogeneity of residuals
variance for all groups.
The residuals are assumed to have a constant variance
(homoscedasticity)
No significant outliers in the groups
Related libraries
library(tidyverse)
library(ggpubr)
library(rstatix)
library(broom)
library(datarium)
#make this example reproducible
set.seed(10)
#create dataset
data <- data.frame(technique = rep(c("A", "B", "C"), each = 30),
current_grade = runif(90, 65, 95),
exam = c(runif(30, 80, 95), runif(30, 70, 95), runif(30, 70, 90)))
head(data)
## technique current_grade exam
## 1 A 80.22435 87.32759
## 2 A 74.20306 90.67114
## 3 A 77.80723 88.87902
## 4 A 85.79306 87.75735
## 5 A 67.55408 85.72442
## 6 A 71.76310 92.52167
Explore the data
Before we fit the ANCOVA model, we should first explore the data to gain a better understanding of it and verify that there aren’t any extreme outliers that could skew the results.
First, we can view a summary of each variable in the dataset:
summary(data)
## technique current_grade exam
## Length:90 Min. :65.43 Min. :71.17
## Class :character 1st Qu.:71.79 1st Qu.:77.27
## Mode :character Median :77.84 Median :84.69
## Mean :78.15 Mean :83.38
## 3rd Qu.:83.65 3rd Qu.:89.22
## Max. :93.84 Max. :94.76
Next, we can use the dplyr package to easily find the mean and the standard deviation of both the current grades and the exam scores for each studying technique:
#load dplyr
library(dplyr)
data %>%
group_by(technique) %>%
summarise(mean_grade = mean(current_grade),
sd_grade = sd(current_grade),
mean_exam = mean(exam),
sd_exam = sd(exam))
## # A tibble: 3 × 5
## technique mean_grade sd_grade mean_exam sd_exam
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A 79.0 7.00 88.5 3.88
## 2 B 78.5 8.33 81.8 7.62
## 3 C 76.9 8.24 79.9 5.71
Please plot a boxplot showing the relationship between exam grades with technique
# ?boxplot
Similarly, we can also use boxplots to visualize the distribution of current grades based on studying technique:
# ?boxplot
Next, we’ll fit the ANCOVA model using exam score as the response variable, studying technique as the predictor (or “treatment”) variable, and current grade as the covariate. We start with the complete model, and consider the interaction between variables.
We’ll use the Anova() function in the car package to do so, just so we can specify that we’d like to use type III sum of squares for the model, since type I sum of squares is dependent upon the order that the predictors are entered into the model
library(car)
# ?lm
# ?Anova
Now let’s fit a simpler model without interaction.
# ?lm
Although the ANCOVA results told us that studying technique had a statistically significant effect on exam scores, we need to run post hoc tests to actually find out which studying techniques differ from each other.
To do so, we can use the glht() function within the multcomp package in R to perform Tukey’s Test for multiple comparisons:
library(multcomp)
#fit the ANCOVA model
# notice here you need to make your technique as a factor variable first, otherwise you will notice an error
data$technique <- as.factor(data$technique)
# ?glht
# for the glht function you will use, set: linfct = mcp(technique = "Tukey")
# after fitting the General linear hypothese, use summary to take a look at the result
What if you would like to take a look at the confidence interval of the mean difference using different teaching techniques? Use function ‘confint()’ here.
#view the confidence intervals associated with the multiple comparisons
# confint() one the glht hypotheses you just did